%%HTML
<script src="require.js"></script>
from IPython.display import display, HTML
HTML(
"""
<script
src='https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js'>
</script>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action='javascript:code_toggle()'>
<input type="submit" value='Click here to toggle on/off the raw code.'>
</form>
"""
)
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from collections import Counter
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
# Set helper functions
def plot_3d(df_new, y_predict=None):
"""
Create a 3D scatter plot using Plotly.
Parameters
----------
df_new : DataFrame or array-like
Input data for the 3D scatter plot.
y_predict : array-like, optional
Array of values used for coloring the markers.
"""
fig = go.Figure(data=[go.Scatter3d(
x=df_new[:, 0],
y=df_new[:, 1],
z=df_new[:, 2],
mode='markers',
marker=dict(size=5,
color=y_predict,
opacity=0.8))])
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0),
scene=dict(
xaxis_title='PCA 1',
yaxis_title='PCA 2',
zaxis_title='PCA 3'))
fig.show()
def plot_dendrogram(Z):
"""
Plot a truncated dendrogram using Ward's Method.
Parameters
----------
Z : array-like
The linkage matrix representing the hierarchical clustering.
Returns
-------
matplotlib.axes._axes.Axes
The matplotlib axes containing the dendrogram.
"""
fig, ax = plt.subplots()
dn = dendrogram(Z, ax=ax, p=5, truncate_mode='level')
ax.set_xlabel(r'Datapoints')
ax.set_ylabel(r'$\Delta$')
ax.set_title("Figure 4. Dendrogram for Ward's Method")
return ax
Abstract
The study delves into the critical task of identifying distinct customer segments within a business's consumer base and understanding their key characteristics, purchase behaviors, and preferences. By employing hierarchical clustering with Ward's method on a dataset encompassing diverse aspects such as people, products, promotion, and place, the analysis reveals two main customer clusters with multiple subclusters. The analysis reveals two main clusters, "Affluent Traditionalists" and "Digital Economizers," with the second being further clustered into two sub-clusters, "Budget-Focused Families" and "Flexible Consumers", each presenting unique characteristics and behaviors.
Recommendations are tailored for targeted marketing and engagement efforts for each cluster. For the Affluent Traditionalists, the focus should be on premium offerings, personalized services, and enhancing the in-store experience. On the other hand, Digital Economizers require strategies revolving around digital engagement, value-based promotions, and accommodation of the needs of budget-conscious families. The study outlines limitations, such as potential data diversity constraints and the absence of qualitative insights, urging future research to address these gaps. Additionally, advanced methodologies like predictive analytics, machine learning models, and qualitative research methods are recommended to enhance segmentation accuracy and provide deeper insights into customer behaviors and preferences. Integrating external factors and mapping customer journeys can further refine the understanding of market dynamics. Feature engineering and utilizing various clustering methods and preprocessing techniques are also suggested to broaden the scope of analysis.
In conclusion, the study not only identifies customer segments but also provides strategic recommendations, outlines limitations, and suggests advanced methodologies for future research, offering a comprehensive roadmap for businesses aiming to tailor their strategies and stay relevant in a dynamic market landscape.
Problem Statement
One of the first and most important questions every business must ask themselves is who they believe their target market is, followed very closely by the next question of what value are they are able to offer to said target market. By initially identifying who their customers are, the business is able to work towards attracting, catering to, and retaining them.
However, more often than not, a company's consumer is not limited to a single type of customer profile. Most customer bases would consist of several different groups that share distinct characteristics, purchase behaviors and product preferences. With these differences, the business can take advantage of this information to enhance their marketing, sales and operations strategies to tailor the customer experience of each valued segment.
With this in mind, the business should ask themselves the following:
What are the different segments of a company’s customers, and what are their key characteristics that differentiate them from each other?
The report aims to answer this problem, as well as give further insight on the next possible steps the business can take upon identification of the customer segments.
Motivation
There are several reasons on why identifying the different segments of a business' customer base is important. Firstly, it allows the company to conduct targeted marketing and sales efforts depending on their ideal customer, wherein they can design both products and campaigns that their target customers would most likely engage with. A marketing strategy with focus gives the company the ability to optimize how they spend their budget and resources to reach that specific niche. Additionally, understanding the preferences of several customer segments allows more flexibility for the company in terms of resource allocation. Customer experience, and in the long-term, retention can also be improved with the knowledge of customer profiles, as understanding their behaviors and preferences can help the company design programs with the aim of keeping them engaged and content.
The challenge is that the consumer base is constantly evolving and changing. What could be customers of a business now may look very different from a few years ago or how it would look like in the future. A business who strives to understand their customer and its several segments can cater to these diverse set of needs. Overall, the aim is to provide data to the company that could be leveraged to make informed decision-making when it comes to keeping their customers satisfied and provide the company with a competitive advantage.
Data Source
The data used for the study is available via Asian Institute of Management (AIM) Jojie-collected public datasets under the directory /mnt/data/public/customer-personality-analysis/marketing_campaign.csv and was loaded via pandas. The data can also be found via Kaggle (Patel, 2021). As described in the website, the dataset contains four main categories: people, products, promotion and place.
Each category type pertains to the following:
People: Identifiers of each customer such as their ID, date of birth and other personal features relating to the customer themselves.Products: Relates to the amount of money spent differing products such as fruit or meat.Promotion: Refers to customer behavior towards marketing campaigns and discounts.Place: Number of purchases a customer makes via different channels such as the store or via the web.
Data Exploration
Displayed in Table 1. is a snapshot of the Customer Personality dataset, loaded via pandas from the csv file. The raw data contains around 2240 row entries representing each customer and 29 columns representing the different customer features.
Table 1. Snapshot of Customer Personality Analysis Dataset
df = pd.read_csv('/mnt/data/public/customer-personality-analysis/'
'marketing_campaign.csv', sep='\t')
df.columns = df.columns.str.lower()
display(df.head())
print(f'Data dimensions: {df.shape}')
| id | year_birth | education | marital_status | income | kidhome | teenhome | dt_customer | recency | mntwines | ... | numwebvisitsmonth | acceptedcmp3 | acceptedcmp4 | acceptedcmp5 | acceptedcmp1 | acceptedcmp2 | complain | z_costcontact | z_revenue | response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
5 rows × 29 columns
Data dimensions: (2240, 29)
Data Overview
The names, data type, descriptions and category of each feature is detailed under Table 2., using descriptions and categories provided in Kaggle as reference (Patel, 2021).
Table 2. Description of Customer Personality Analysis Features
| Feature Name | Type | Description | Category |
|---|---|---|---|
| id | integer | Customer's unique identifier | people |
| year_birth | integer | Customer's birth year | people |
| education | object | Customer's education level | people |
| marital_status | object | Customer's marital status | people |
| income | float | Customer's yearly household income | people |
| kidhome | integer | Number of children in customer's household | people |
| teenhome | integer | Number of teenagers in customer's household | people |
| dt_customer | object | Date of customer's enrollment with the company | people |
| recency | integer | Number of days since customer's last purchase | people |
| mntwines | integer | Amount spent on wine in last 2 years | products |
| mntfruits | integer | Amount spent on fruits in last 2 years | products |
| mntmeatproducts | integer | Amount spent on meat in last 2 years | products |
| mntfishproducts | integer | Amount spent on fish in last 2 years | products |
| mntsweetproducts | integer | Amount spent on sweets in last 2 years | products |
| mntgoldprods | integer | Amount spent on gold in last 2 years | products |
| numdealspurchases | integer | Number of purchases made with a discount | promotion |
| numwebpurchases | integer | Number of purchases made through the company’s website | place |
| numcatalogpurchases | integer | Number of purchases made using a catalogue | place |
| numstorepurchases | integer | Number of purchases made directly in stores | place |
| numwebvisitsmont | integer | Number of visits to company’s website in the last month | place |
| acceptedcmp3 | integer | 1 if customer accepted the offer in the 3rd campaign, 0 otherwise | promotion |
| acceptedcmp4 | integer | 1 if customer accepted the offer in the 4th campaign, 0 otherwise | promotion |
| acceptedcmp5 | integer | 1 if customer accepted the offer in the 5th campaign, 0 otherwise | promotion |
| acceptedcmp1 | integer | 1 if customer accepted the offer in the 1st campaign, 0 otherwise | promotion |
| acceptedcmp2 | integer | 1 if customer accepted the offer in the 2nd campaign, 0 otherwise | promotion |
| complain | integer | 1 if the customer complained in the last 2 years, 0 otherwise | people |
| z_costcontact | integer | Unknown variable | others |
| z_revenue | integer | Unknown variable | others |
| response | integer | 1 if customer accepted the offer in the last campaign, 0 otherwise | people |
Assumptions:
- Amounts are assumed to be for US dollars (USD).
- For
products-category features, amount spent on products is assumed to be the monthly average for the past 2 years. - For
place-category features, number of purchases is assumed to be on a monthly average basis.
For object-type, categorical features such as education and marital_status, the values are listed down in Table 3.
Table 3. Values under Categorical Features
| education | marital_status |
|---|---|
| Basic | Single |
| 2n Cycle | Together |
| Graduation | Married |
| Master | Divorced |
| PhD | Widow |
| Alone | |
| Absurd | |
| YOLO |
The dt_customer feature is currently set as an object data-type, displayed in Table 4.
Table 4. Snapshot of Dt_customer Feature
display(df[['dt_customer']].head())
| dt_customer | |
|---|---|
| 0 | 04-09-2012 |
| 1 | 08-03-2014 |
| 2 | 21-08-2013 |
| 3 | 10-02-2014 |
| 4 | 19-01-2014 |
The statistics of relevant integer and float-type features can be viewed per category type to understand their average values and expected variances.
Table 5. Statistics for Features under People Category
col_people = ['year_birth',
'income',
'kidhome',
'teenhome',
'recency',
'complain']
col_products = ['mntwines',
'mntfruits',
'mntmeatproducts',
'mntfishproducts',
'mntsweetproducts',
'mntgoldprods']
col_promotion = ['numdealspurchases',
'acceptedcmp1',
'acceptedcmp2',
'acceptedcmp3',
'acceptedcmp4',
'acceptedcmp5',
'response']
col_place = ['numwebpurchases',
'numcatalogpurchases',
'numstorepurchases',
'numwebvisitsmonth']
col_others = ['z_costcontact',
'z_revenue']
df.describe().loc[['mean', 'std'], col_people]
| year_birth | income | kidhome | teenhome | recency | complain | |
|---|---|---|---|---|---|---|
| mean | 1968.805804 | 52247.251354 | 0.444196 | 0.506250 | 49.109375 | 0.009375 |
| std | 11.984069 | 25173.076661 | 0.538398 | 0.544538 | 28.962453 | 0.096391 |
Based on the statistics displayed on Table 5., the average consumer base of the company has a household income of around 50,000 USD, likely has kids or teens at home, and unlikely to complain.
Table 6. Statistics for Features under Products Category
df.describe().loc[['mean', 'std'], col_products]
| mntwines | mntfruits | mntmeatproducts | mntfishproducts | mntsweetproducts | mntgoldprods | |
|---|---|---|---|---|---|---|
| mean | 303.935714 | 26.302232 | 166.950000 | 37.525446 | 27.062946 | 44.021875 |
| std | 336.597393 | 39.773434 | 225.715373 | 54.628979 | 41.280498 | 52.167439 |
From Table 6. it can be concluded that the average consumer spends the most amount of money on wine at around 300 USD, followed by meat products at 150 USD.
Table 7. Statistics for Features under Promotion Category
df.describe().loc[['mean', 'std'], col_promotion]
| numdealspurchases | acceptedcmp1 | acceptedcmp2 | acceptedcmp3 | acceptedcmp4 | acceptedcmp5 | response | |
|---|---|---|---|---|---|---|---|
| mean | 2.325000 | 0.064286 | 0.013393 | 0.072768 | 0.074554 | 0.072768 | 0.149107 |
| std | 1.932238 | 0.245316 | 0.114976 | 0.259813 | 0.262728 | 0.259813 | 0.356274 |
Based on Table 7., on average, customers have purchased on a discount twice. Around 6-7% have accepted campaign offers, with the exception of Campaign 2 which was the least successful, capturing only around 1% of the customer base. The most recent campaign was the most successful, with a 14% acceptance rate on average.
Table 8. Statistics for Features under Place Category
df.describe().loc[['mean', 'std'], col_place]
| numwebpurchases | numcatalogpurchases | numstorepurchases | numwebvisitsmonth | |
|---|---|---|---|---|
| mean | 4.084821 | 2.662054 | 5.790179 | 5.316518 |
| std | 2.778714 | 2.923101 | 3.250958 | 2.426645 |
The Table 8. shows that on average, customers are most likely to purchase their products from the store, followed by via the web, and are the least likely to purchase through catalog. Additionally, customers visit the company website around 5 times on a monthly average.
Table 9. Statistics for Features under Others Category
df.describe().loc[['mean', 'std'], col_others]
| z_costcontact | z_revenue | |
|---|---|---|
| mean | 3.0 | 11.0 |
| std | 0.0 | 0.0 |
The z_costcontact and z_revenue features are deemed to be unnecessary features as they are unknown variables with constant values throughout the whole dataset, as shown in Table 9.
Using the statistics above, the average customer of the company can be inferred to have the following characteristics.
Baseline Average Customer:
They are likely born between 1963 to 1973, with an average annual income of 50,000 USD, and likely to have children at home. They most likely spend the most amount of wines, followed by meat products. They have a 6-7% chance of accepting campaign offers, and are most likely to purchase from the store.
Methodology Overview
Figure 1. Methology Overview
Captured in Figure 1. is the high-level overview of the methodology pipeline of the study. Detailed below in Table 10. are the details of the methodology pipeline to be conducted to address the problem of determining customer segments of the company.
Table 10. Methology Details
| Step | Process | Description |
|---|---|---|
| 1 | Data Cleaning and Preprocessing | Prepare the dataset by handling missing values, mapping ordinal values, one-hot encoding nominal features, and excluding unnecessary features |
| 2 | Dimensionality Reduction | Standardize the data and perform dimensionality reduction via PCA |
| 3 | Hierarchical-based Clustering | Plot the dendrogram to determine threshold and perform clustering via Ward's method to determine the clusters |
| 4 | Results and Discussion | Analyze the results of the clusters and sub-clusters to produce insights on each customer segment relating to their characteristics and preferences |
| 5 | Conclusion | Summarize the insights to create customer profiles, and suggest a marketing strategy per segment based on the results |
| 6 | Recommendation | Provide recommendations for future studies on possible further improvements that could be made given the limitations of the current project |
Data Cleaning and Preprocessing
Handling of Missing Values
The first step to be conducted for data preparation is the identification of any missing or null values in the dataset. Missing values could be an indication of issues to certain data entries and may introduce noise that could lead to inaccurate clustering, and misinterpretation of the results.
Table 11. Null Value Count per Feature
df.isnull().sum()
id 0 year_birth 0 education 0 marital_status 0 income 24 kidhome 0 teenhome 0 dt_customer 0 recency 0 mntwines 0 mntfruits 0 mntmeatproducts 0 mntfishproducts 0 mntsweetproducts 0 mntgoldprods 0 numdealspurchases 0 numwebpurchases 0 numcatalogpurchases 0 numstorepurchases 0 numwebvisitsmonth 0 acceptedcmp3 0 acceptedcmp4 0 acceptedcmp5 0 acceptedcmp1 0 acceptedcmp2 0 complain 0 z_costcontact 0 z_revenue 0 response 0 dtype: int64
As observed in Table 11., there are 24 customers out of the total 2240 that have missing or blank entries for their income. To handle these missing values, the decision was to remove these entries altogether rather than to impute their values due to the following reasons:
- Because the purpose of the study is to conduct clustering, imputation may impact the clustering negatively and result in innacurate results, given that there are assumptions being made about an entries
incomefeature. - Removal of entries with the missing feature will cause minimal impact to the entire dataset as it only represent 1% of the customer base.
df.dropna(inplace=True)
print(f'Missing Values Remaining: {df.isnull().sum().sum()}')
Missing Values Remaining: 0
Mapping of Ordinal Features
The next step is to handle categorical features in order to transform them to a data-type suitable for clustering. For ordinal features, mapping can be conducted to assign an integer based on its inherent ranking. The education of the customer is selected as an ordinal features, with the following ranking specified in Table 12.
Table 12. Ordinal Mapping for Education Feature
| Original Value | Mapped Value |
|---|---|
| Basic | 0 |
| Graduation | 1 |
| 2n Cycle | 2 |
| Master | 3 |
| PhD | 4 |
mapping_education = {
'Basic': 0,
'Graduation': 1,
'2n Cycle': 2,
'Master': 3,
'PhD': 4
}
df['education'] = df['education'].map(mapping_education)
One-hot Encoding of Nominal Features
Categorical features without an inherent ranking can be handled by one-hot encoding them to transform these categories to binary data. The marital_status feature is one-hot encoded due to the lack of inherent ranking, and the existence of vague entries such as YOLO and Together. The results are reflected in Table 13.
Table 13. Snapshot of Resulting Columns from One-hot Encoding of Marital Status
col_nominal = ['marital_status']
for col in col_nominal:
df_dummy = pd.get_dummies(
df[col], prefix=col, drop_first=True).astype(int)
df = pd.concat([df, df_dummy], axis=1)
df = df.drop(col_nominal, axis=1)
df.columns = df.columns.str.lower()
display(df[[col for col in df.columns if 'marital_status' in col]].head())
| marital_status_alone | marital_status_divorced | marital_status_married | marital_status_single | marital_status_together | marital_status_widow | marital_status_yolo | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Representation of Object-type Date as Integer
As seen earlier in the data exploration stage, the data-type of dt_customer feature is currently set as an object, and must be transformed to numerical data to make it compatible for clustering.
The feature, which represents the date of when the customer initially enrolled with the company, can be transformed to a datetime component. This can then be further transformed to an integer representation based on the duration of the customer's enrollment, tagged with the new feature enrollment_duration. The duration can be calculated based on the most recent enrollment date acting as the 'as of' date, with the resulting computation displayed in Table 14.
Table 14. Snapshot of New Enrollment Duration Feature
df['dt_customer'] = pd.to_datetime(df['dt_customer'], format='%d-%m-%Y')
latest_enrollment_date = df['dt_customer'].max()
df['enrollment_duration'] = (latest_enrollment_date - df['dt_customer']).dt.days
display(df[['dt_customer', 'enrollment_duration']].head())
df = df.drop('dt_customer', axis=1)
| dt_customer | enrollment_duration | |
|---|---|---|
| 0 | 2012-09-04 | 663 |
| 1 | 2014-03-08 | 113 |
| 2 | 2013-08-21 | 312 |
| 3 | 2014-02-10 | 139 |
| 4 | 2014-01-19 | 161 |
Feature Exclusion
Non-informative features should be considered for removal or exclusion from clustering.
Features z_costcontact and z_revenue are considered for feature exclusion because they only contain constant values throughout the entire database, and there is no other information about what they represent from the online source (Patel, 2021).
The id should be removed as it is only used as an identifier for the different customers. Additionally, since it comes as an integer data-type, it would introduce numerical values that have no informative meaning to the clustering.
col_drop = ['z_costcontact', 'z_revenue', 'id']
df = df.drop(col_drop, axis=1)
The result of the data preparation is found in Table 15., where all features are represented numerically, there are no missing values, and all unnecessary features are excluded. Our final preprocessed dataset contains 2216 customer entries and 32 features.
Table 15. Features of Preprocessed Customer Personality Analysis Data
df = df.astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 2216 entries, 0 to 2239 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year_birth 2216 non-null int64 1 education 2216 non-null int64 2 income 2216 non-null int64 3 kidhome 2216 non-null int64 4 teenhome 2216 non-null int64 5 recency 2216 non-null int64 6 mntwines 2216 non-null int64 7 mntfruits 2216 non-null int64 8 mntmeatproducts 2216 non-null int64 9 mntfishproducts 2216 non-null int64 10 mntsweetproducts 2216 non-null int64 11 mntgoldprods 2216 non-null int64 12 numdealspurchases 2216 non-null int64 13 numwebpurchases 2216 non-null int64 14 numcatalogpurchases 2216 non-null int64 15 numstorepurchases 2216 non-null int64 16 numwebvisitsmonth 2216 non-null int64 17 acceptedcmp3 2216 non-null int64 18 acceptedcmp4 2216 non-null int64 19 acceptedcmp5 2216 non-null int64 20 acceptedcmp1 2216 non-null int64 21 acceptedcmp2 2216 non-null int64 22 complain 2216 non-null int64 23 response 2216 non-null int64 24 marital_status_alone 2216 non-null int64 25 marital_status_divorced 2216 non-null int64 26 marital_status_married 2216 non-null int64 27 marital_status_single 2216 non-null int64 28 marital_status_together 2216 non-null int64 29 marital_status_widow 2216 non-null int64 30 marital_status_yolo 2216 non-null int64 31 enrollment_duration 2216 non-null int64 dtypes: int64(32) memory usage: 571.3 KB
Dimensionality Reduction
Dimensionality reduction will be performed due to the following reasons:
- Highly dimensional data can suffer the curse of dimensionality, as the data becomes more sparse. This can cause challenges during clustering, where results become less meaningful.
- By reducing the number of dimensions, the computational complexity of the dataset can be reduced to improve computational efficiency.
- Irrelevant or noise features that do not offer significant information gain can be reduced, potentially improving clustering results.
Principal Component Analysis (PCA) is the chosen method for its simplicity and efficiency. The dataset is small and non-sparse, making it manageable when performing PCA.
Data Standardization
The data is standardized in preparation for performing PCA. Standardizing the features will ensure that each will contribute evenly to the computation for principal components and will prevent features with larger magnitudes to dominate the calculation. The results of the standardization can be found in Table 16.
Table 16. Snapshot of Scaled Features
standard_scaler = StandardScaler()
df_scaled = standard_scaler.fit_transform(df.values)
display(pd.DataFrame(df_scaled).head())
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.986443 | -0.819198 | 0.234063 | -0.823039 | -0.928972 | 0.310532 | 0.978226 | 1.549429 | 1.690227 | 2.454568 | ... | -0.097812 | 2.377952 | -0.036819 | -0.341958 | -0.794110 | 1.924807 | -0.590553 | -0.188452 | -0.030056 | 1.529129 |
| 1 | -1.236801 | -0.819198 | -0.234559 | 1.039938 | 0.909066 | -0.380509 | -0.872024 | -0.637328 | -0.717986 | -0.651038 | ... | -0.097812 | -0.420530 | -0.036819 | -0.341958 | -0.794110 | 1.924807 | -0.590553 | -0.188452 | -0.030056 | -1.188411 |
| 2 | -0.318822 | -0.819198 | 0.769478 | -0.823039 | -0.928972 | -0.795134 | 0.358511 | 0.569159 | -0.178368 | 1.340203 | ... | -0.097812 | -0.420530 | -0.036819 | -0.341958 | -0.794110 | -0.519533 | 1.693329 | -0.188452 | -0.030056 | -0.205155 |
| 3 | 1.266777 | -0.819198 | -1.017239 | 1.039938 | -0.928972 | -0.795134 | -0.872024 | -0.561922 | -0.655551 | -0.504892 | ... | -0.097812 | -0.420530 | -0.036819 | -0.341958 | -0.794110 | -0.519533 | 1.693329 | -0.188452 | -0.030056 | -1.059945 |
| 4 | 1.016420 | 1.529240 | 0.240221 | 1.039938 | -0.928972 | 1.554407 | -0.391671 | 0.418348 | -0.218505 | 0.152766 | ... | -0.097812 | -0.420530 | -0.036819 | -0.341958 | 1.259271 | -0.519533 | -0.590553 | -0.188452 | -0.030056 | -0.951244 |
5 rows × 32 columns
PCA
To set-up the PCA, a random_state was selected and performed on the scaled data. The cumulative explained variance is used to determine how many principcal components to consider.
pca = PCA(random_state=42)
df_new = pca.fit_transform(df_scaled)
variance_explained = pca.explained_variance_ratio_
cumulative_variance_explained = variance_explained.cumsum()
fig, ax = plt.subplots()
ax.plot(range(1, len(variance_explained) + 1),
variance_explained,
'-',
label='individual')
ax.set_xlim(0, len(variance_explained)+1)
ax.set_xlabel('Number of PCAs')
ax.set_ylabel('Variance explained')
ax = ax.twinx()
ax.plot(range(1, len(variance_explained) + 1),
cumulative_variance_explained,
'r-',
label='cumulative')
ax.set_ylabel('Cumulative variance explained')
ax.axhline(0.81, ls='--', color='g')
ax.axvline(17, ls='--', color='g')
ax.set_title('Figure 2. Variance and Cumulative Variance Explained across PCAs');
From the results in Figure 2., we can reach a cumulative variance explained of 0.8 when using 17 principcal components. This will be the number of components used when performing dimensionality reduction. Displayed in Table 17. are the 17 retained principal components to be used during clustering.
Table 17. Snapshot of Scaled and Reduced Features
pca = PCA(n_components=17, random_state=42)
df_pca_reduced = pca.fit_transform(df_scaled)
display(pd.DataFrame(df_pca_reduced).head())
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.830161 | -0.181321 | 0.167833 | 3.219346 | -1.252481 | -1.094959 | -0.583047 | 0.165437 | 0.750438 | -0.013983 | -0.179083 | -0.276510 | 0.263946 | -0.477823 | 0.196159 | -1.440021 | 1.344352 |
| 1 | -2.435863 | -0.638125 | -0.241413 | -0.860753 | -0.924621 | -1.612613 | -0.846041 | 0.219046 | 0.197928 | -0.640864 | 0.008982 | 0.472702 | -1.303607 | 0.426551 | -0.993351 | -0.729119 | 0.924650 |
| 2 | 1.600887 | -0.185814 | -1.257625 | -0.243510 | -1.254225 | 1.250047 | 0.419811 | 0.540887 | -0.018901 | 0.621133 | 0.143211 | 0.648513 | -0.886527 | -0.250879 | 0.712234 | 0.063698 | -1.008799 |
| 3 | -2.569622 | -1.534893 | 0.002154 | -0.245513 | -1.094061 | 1.608984 | 0.435226 | 0.497322 | -0.152752 | 0.552256 | 0.382589 | 0.483710 | -0.574400 | -0.288417 | -0.387440 | 0.845467 | -0.584655 |
| 4 | -0.502419 | -0.151635 | -0.586862 | -0.015381 | 1.647314 | 0.304772 | -0.570991 | 0.467957 | -0.900724 | -0.458106 | 0.217740 | -0.945133 | 1.616572 | -0.205024 | -0.852598 | 1.661447 | 0.257108 |
Plotting the datapoints on the first three principal components, the dataset can be visualized in Figure 3.
plot_3d(df_pca_reduced, y_predict=None)
Figure 3. 3D Plot of Dataset
Hierarchical-based Clustering
Ward's Method
Hierarchical-based Ward's method is the chosen clustering method due to the following reasons:
- The plot of the dendrogram can be used to visually determine the number of clusters generated, and allows for the opportunity to identify sub-clusters beneath larger, main clusters. This is particularly useful for the study as it allows for flexibility on customer segmentation, allowing the company to further segment large customer sections to smaller groups when needed.
- In comparison to other linkage methods, Ward's method is less sensitive to the shape of the cluster and tends to form compact, spherical clusters.
To perform the clustering, the dendrogram is first plotted to determine the epsilon threshold, which will in turn determine the total number of clusters and their corresponding datapoints.
Z = linkage(df_pca_reduced, method='ward', optimal_ordering=True)
plot_dendrogram(Z);
From the results of the dendrogram in Figure 4. two clusters can be formed when selecting the epsilon threshold between 90 to 120, where the largest branch gap can be found.
There is also potential for the segmentation of sub-clusters at around the epsilon=80 and epsilon=70 thresholds, for three clusters and four resulting clusters respectively.
y_predict_ward_two = fcluster(Z, t=120, criterion='distance')
plot_3d(df_pca_reduced, y_predict_ward_two)
Figure 5. 3D Plot for Two Clusters
The two main clusters can be plotted against the first three principal components, found in the Figure 5.. From visual inspection, the clustering shows some of the desired characteristics of what commonly characterizes good clustering:
- Relatively compact for points within the cluster
- Relatively separated from points outside the cluster
- Parsimonous, consisting of only two clusters
The same process is conducted for the three and four-clusters, where the count of datapoints per cluster is plotted on Table 18. and Table 19. respectively.
Table 18. Customer Count for Three Clusters
y_predict_ward_three = fcluster(Z, t=80, criterion='distance')
display(pd.DataFrame.from_dict(Counter(y_predict_ward_three), orient='index', columns=['count']))
| count | |
|---|---|
| 1 | 541 |
| 2 | 811 |
| 3 | 864 |
Table 19. Customer Count for Four Clusters
y_predict_ward_four = fcluster(Z, t=70, criterion='distance')
display(pd.DataFrame.from_dict(Counter(y_predict_ward_four), orient='index', columns=['count']).sort_index())
| count | |
|---|---|
| 1 | 30 |
| 2 | 511 |
| 3 | 811 |
| 4 | 864 |
While four clusters is feasible, the ideal number of sub-clustering will be set to three instead. This is due to one cluster containing only 30 customers within the segment, which may not be particularly useful for the marketing and sales strategy and may be too niche of a segment.
The three clusters are visualized in Figure 6. against the first three principal components. Similar to the insights from the two clusters, the three segments also show the desired characteristics of good clustering from visual inspection.
plot_3d(df_pca_reduced, y_predict_ward_three)
Figure 6. 3D Plot for Three Clusters
The resulting cluster and sub-clusters can be projected back to the features of the original dataframe in order to make inferences on the distinct characteristics of each cluster type and create customer profiles. Table 20. displays each customer, their features and what cluster and sub-clusters they are tagged under.
Table 20. Customer Features and Corresponding Cluster and Sub-clusters
df_cluster = df.copy()
df_cluster['cluster'] = y_predict_ward_two
df_subcluster = df_cluster.copy()
df_subcluster['subcluster'] = y_predict_ward_three
display(df_subcluster.head(10))
| year_birth | education | income | kidhome | teenhome | recency | mntwines | mntfruits | mntmeatproducts | mntfishproducts | ... | marital_status_alone | marital_status_divorced | marital_status_married | marital_status_single | marital_status_together | marital_status_widow | marital_status_yolo | enrollment_duration | cluster | subcluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | 1 | 58138 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 663 | 1 | 1 |
| 1 | 1954 | 1 | 46344 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 113 | 2 | 2 |
| 2 | 1965 | 1 | 71613 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 312 | 1 | 1 |
| 3 | 1984 | 1 | 26646 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 139 | 2 | 2 |
| 4 | 1981 | 4 | 58293 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 161 | 1 | 1 |
| 5 | 1967 | 3 | 62513 | 0 | 1 | 16 | 520 | 42 | 98 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 293 | 2 | 3 |
| 6 | 1971 | 1 | 55635 | 0 | 1 | 34 | 235 | 65 | 164 | 50 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 593 | 2 | 3 |
| 7 | 1985 | 4 | 33454 | 1 | 0 | 32 | 76 | 10 | 56 | 3 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 417 | 2 | 3 |
| 8 | 1974 | 4 | 30351 | 1 | 0 | 19 | 14 | 0 | 24 | 3 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 388 | 2 | 2 |
| 9 | 1950 | 4 | 5648 | 1 | 1 | 68 | 28 | 0 | 6 | 1 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 108 | 2 | 3 |
10 rows × 34 columns
Results and Discussion
To form customer profiles for each customer segment, the analysis of the average characteristics per cluster type was grouped into stages based on the feature categories: people, marital status, products, promotion and place.
col_people = ['year_birth', 'education', 'income', 'kidhome', 'teenhome', 'recency', 'complain', 'enrollment_duration']
col_marital_status = ['marital_status_alone', 'marital_status_divorced', 'marital_status_married', 'marital_status_single', 'marital_status_together', 'marital_status_widow', 'marital_status_yolo']
People Category
This category shows the insights for the unique identifiers that define a customer, such as their age and income, displayed in Table 21.
Table 21. People Category Averages per Cluster
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_people]
| year_birth | education | income | kidhome | teenhome | recency | complain | enrollment_duration | ||
|---|---|---|---|---|---|---|---|---|---|
| cluster | subcluster | ||||||||
| 1 | 1 | 1969.523105 | 1.898336 | 75463.029575 | 0.057301 | 0.231054 | 49.580407 | 0.000000 | 353.693161 |
| 2 | 2 | 1971.678175 | 1.887793 | 35222.504316 | 0.762022 | 0.442663 | 49.236745 | 0.000000 | 314.610358 |
| 3 | 1965.697917 | 2.288194 | 53690.924769 | 0.381944 | 0.736111 | 48.446759 | 0.024306 | 389.937500 |
Subcluster 1: Of average age, with lower educational attainment, belonging the high-income bracket, and likely has no children.Subcluster 2: Are younger, with lower educational attainment, belonging the lower-income bracket, and likely with young children at home.Subcluster 3: Are older, with higher educational attainment, belonging the mid-income bracket, and likely with older children at home.
Marketing to Subcluster 1 could focus on premium products and services due to their high income. In terms of targeted approach. Subcluster 2, with younger children and lower income, should be more receptive to budget-friendly offerings and family deals. Subcluster 3, being older and more educated, appreciates more detailed information and a higher level of customer service.
Marital Status Category
This category shows the insights on the marital status of each segment, illustrated in Table 22.
Table 22. Marital Status Category Averages per Cluster
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_marital_status]
| marital_status_alone | marital_status_divorced | marital_status_married | marital_status_single | marital_status_together | marital_status_widow | marital_status_yolo | ||
|---|---|---|---|---|---|---|---|---|
| cluster | subcluster | |||||||
| 1 | 1 | 0.000000 | 0.016636 | 0.434381 | 0.251386 | 0.292052 | 0.001848 | 0.000000 |
| 2 | 2 | 0.000000 | 0.000000 | 0.440197 | 0.271270 | 0.288533 | 0.000000 | 0.000000 |
| 3 | 0.003472 | 0.258102 | 0.306713 | 0.133102 | 0.209491 | 0.086806 | 0.002315 |
Subcluster 1: Likely married or with a partner, some single, and a few divorced.Subcluster 2: Likely married or with a partner, and some single.Subcluster 3: May be married or with a partner, many divorced, and a few widowed.
Subcluster 1, primarily consists of individuals who are married or in a partnership, with a significant portion being single and a smaller fraction divorced. This diversity suggests varying needs and preferences within the cluster. In Subcluster 2, the majority are either married or with a partner, with a notable number of singles. The absence of divorced or widowed individuals might indicate a more homogenous group in terms of life experiences. Subcluster 3, displays a more varied marital status distribution. Apart from those married or with partners, there is a substantial proportion of divorced and a noticeable number of widowed individuals, indicating a possibly older demographic with diverse life experiences.
Married or partnered individuals in Subclusters 1 and 2 respond to promotions targeting families or couples. In contrast, the diverse marital statuses in Subcluster 3 suggest the need for a more varied marketing approach.
Products Category
This category focuses on the purchasing behavior of different customer segments, specifically looking at their spending in various product categories, shown in Table 23.
Table 23. Products Category Averages per Cluster
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_products]
| mntwines | mntfruits | mntmeatproducts | mntfishproducts | mntsweetproducts | mntgoldprods | ||
|---|---|---|---|---|---|---|---|
| cluster | subcluster | ||||||
| 1 | 1 | 573.316081 | 66.340111 | 428.203327 | 95.051756 | 67.338262 | 77.646950 |
| 2 | 2 | 52.335388 | 5.727497 | 26.855734 | 8.914920 | 5.908755 | 16.168927 |
| 3 | 374.392361 | 20.682870 | 134.982639 | 28.648148 | 21.613426 | 48.966435 |
Subcluster 1: Even amount of wine and meat products boughtSubcluster 2: Fewest amount of products boughtSubcluster 3: Relatively high amount of wine bought, following by meat products
Subclaster 1, shows a balanced spending pattern with a significant expenditure on wines and meat products. This indicates a preference for these categories and possibly a higher disposable income. Subcluster 2, customers in this group exhibit the lowest spending across all categories. This could reflect a lower purchasing power or a different set of priorities and preferences. Subcluster 3, has a relatively high expenditure on wines, followed by meat products. Their spending pattern is between Subcluster 1 and Subcluster 2, suggesting a moderate level of disposable income and a preference for certain luxury or high-quality products.
Marketing efforts directed to Subcluster 2, needs to focus more on value-for-money products and promotions that highlight affordability. This group could be more responsive to discounts and bundle deals. The popularity of wines in Subclusters 1 and 3 suggests a potential market for exclusive wine-related products or events. There's potential for cross-selling and upselling based on these spending patterns. For instance, customers in Subcluster 1 who spend heavily on wines and meats might be interested in gourmet food pairings or luxury kitchenware.
df_subcluster['amt_spent'] = (
df_subcluster['mntwines'] +
df_subcluster['mntfruits'] +
df_subcluster['mntmeatproducts'] +
df_subcluster['mntfishproducts'] +
df_subcluster['mntsweetproducts'] +
df_subcluster['mntgoldprods']
)
scatter = plt.scatter(
x=df_subcluster['amt_spent'],
y=df_subcluster['income'],
c=df_subcluster['subcluster'],
alpha=0.8)
plt.title("Figure 7. Income and Spending per Subcluster")
plt.xlabel('Amount Spent on Products')
plt.ylabel('Income')
plt.xlim(left=0, right=2600)
plt.ylim(bottom=0, top=200000)
plt.grid(True, linestyle='--', alpha=0.5)
handles, labels = scatter.legend_elements()
plt.legend(handles, labels, title='Subclusters')
plt.tight_layout()
plt.show()
Plotting the income and amount spent on Figure 7., the Subcluster 1 spends more expenditure is on the estimated range 1,000 to 2,500 with income above 50,000, while the Subcluster 2 and Subcluster 3 are scattered from estimated range 0 to 2,500 with income with estimated range of 0 to 75,000.
Promotions Category
This category focuses on how different customer segments respond to marketing campaigns and their propensity to make purchases through deals, displayed in Table 24.
Table 24. Promotions Category Averages per Cluster
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_promotion]
| numdealspurchases | acceptedcmp1 | acceptedcmp2 | acceptedcmp3 | acceptedcmp4 | acceptedcmp5 | response | ||
|---|---|---|---|---|---|---|---|---|
| cluster | subcluster | |||||||
| 1 | 1 | 1.510166 | 0.205176 | 0.055453 | 0.068392 | 0.103512 | 0.231054 | 0.240296 |
| 2 | 2 | 2.028360 | 0.002466 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.062885 |
| 3 | 3.109954 | 0.033565 | 0.000000 | 0.145833 | 0.125000 | 0.042824 | 0.175926 |
Subcluster 1: Least likely to purchase with discount, Frequently accepts campaigns, High success for campaign 1, 5 and recentSubcluster 2: Sometimes purchases with a discount, Almost never accepts campaign, few success with campaign 1Subcluster 3: Frequently purchases with a discount, Sometimes accepts campaigns, Relative success for campaign 3, 4 and recent
Subcluster 1 is characterized by a low frequency of purchasing with discounts but shows a higher likelihood of responding to marketing campaigns, particularly Campaigns 1, 5, and the most recent campaign. This suggests a segment that is less price-sensitive but more responsive to targeted marketing efforts. Customers in Subcluster 2 occasionally make purchases with discounts but have a very low rate of responding to marketing campaigns. This indicates a segment that is somewhat price-conscious but generally indifferent to marketing initiatives. Subcluster 3 frequently purchases with discounts and shows some responsiveness to marketing campaigns, particularly Campaigns 3, 4, and the recent campaign. This suggests a segment that is both price-sensitive and somewhat receptive to marketing. Subcluster 1 is more receptive to exclusive, non-discounted offers, while Subcluster 3 respond better to promotions that offer clear value or discounts.
For price sensitivity, Subcluster 2’s tendency to sometimes purchase with discounts but low campaign acceptance rate could indicate a segment that is opportunistic in its purchasing behavior, seeking deals but not actively engaging with marketing efforts. This suggests a need for more compelling value propositions or alternative engagement strategies.
Place Category
This category provides insights into how each customer subcluster prefers to make purchases (web, catalog, or in-store) and their engagement with the company website, illustrated in Table 25.
Table 25. Place Category Averages per Cluster
df_subcluster.groupby(['cluster', 'subcluster']).agg('mean').loc[:, col_place]
| numwebpurchases | numcatalogpurchases | numstorepurchases | numwebvisitsmonth | ||
|---|---|---|---|---|---|
| cluster | subcluster | ||||
| 1 | 1 | 5.162662 | 5.659889 | 8.434381 | 3.020333 |
| 2 | 2 | 2.282367 | 0.600493 | 3.453761 | 6.353884 |
| 3 | 5.103009 | 2.743056 | 6.355324 | 5.787037 |
Subcluster 1: Likely to purchase from store, followed by catalog or web, least likely to visit company websiteSubcluster 2: Likely to purchase from store or web, infrequent purchase from catalog, frequently visits company websiteSubcluster 3: Likely to purchase from store or web, followed by catalog, sometimes visits company website
Subcluster 1 shows a strong preference for in-store purchases, followed by catalog and web purchases. They are the least likely to visit the company's website. This suggests a segment that values the physical shopping experience and possibly personal interaction. Subcluster 2 are inclined to make purchases both in-store and through the web, but they infrequently purchase through catalogs. They also frequently visit the company website, indicating a higher level of online engagement. Subcluster 3 prefers to purchase from stores or through the web, followed by catalog purchases. Their frequency of website visits is moderate, suggesting a balanced approach to both online and offline shopping experiences.
For Subcluster 1, enhancing the in-store experience and catalog design could lead to increased customer satisfaction and sales. For Subclusters 2 and 3, improving the online shopping experience and website usability might be more impactful. The high frequency of web visits by Subcluster 2 offers an opportunity to gather more data on customer preferences and behaviors through their online interactions, which can be used to further personalize offers and recommendations.
Conclusion
Using the hierachical-based Ward's method of clustering, the study identified the distinct segments within the data and the key characteristics that differentiate these segments, displayed in Figure 8.
Figure 8. The Customer Segments
Cluster 1: The Affluent Traditionalists
Subcluster 1 The Premier Traditionalists: These inviduals are characterized by a preference for premium goods, responsiveness to marketing campaigns, and a tendency for in-store shopping. This subcluster represents the premium and engaged segment emphasized by their high-end purchasing preferences and the significant expenditure on premium categories like wines and meats.
Cluster 2: The Digital Economizers
Subcluster 2 The Budget-Focused Families: These individuals are marked by budget-consciousness, digital engagement, and a lower response to marketing campaigns. This subcluster embodies the value-conscious and digital-engager segment highlighted by their frequent web visits and balanced online and offline purchasing behavior suggest comfort with digital platforms, coupled with a focus on economical choices.
Subcluster 3 The Flexible Consumers: These indiduduals are characterized with their moderate spending and balanced approach to shopping and campaign responsiveness. This subcluster captures the moderate and diverse shopper segment highlighted by their versatility in adapting to different shopping modes and responsiveness to various marketing campaigns.
Clustering enables the company to tailor its marketing, sales, and operational strategies to each unique segment. Capturing each segment's preferences and behaviors allows for the development of targeted marketing and sales efforts, ensuring the optimization of budget and resources. This targeted approach not only enhances customer experience but also bolsters long-term retention by aligning the company's offerings and communication with the specific needs and preferences of each segment. Lastly, due to clustering, allocating resources more efficiently by focusing efforts on the most profitable or responsive customer segments.
The main clusters strategic roadmap for tailoring marketing and engagement efforts. For Main Cluster 1, the focus should be on premium offerings, personalized services, and enhancing the in-store experience. For Main Cluster 2, strategies should revolve around digital engagement, value-based promotions, and accommodating the needs of budget-conscious families.
Recommendations
The application of hierarchical clustering with the Ward's method has yielded insightful customer segmentations. To further enhance the effectiveness of this approach, it is recommended that future research builds on these findings, address the below outlined limitations and integrating advanced methodologies. This continued effort will be instrumental in refining the segmentation strategy and maintaining the company's relevance in a dynamic market landscape.
Limitations
- The current analysis is based on a specific dataset, which may not capture the full spectrum of the customer base. There is a potential limitation in the diversity of the data, possibly overlooking emerging customer segments or underrepresented demographics. An example of this is the nature of the previous marketing campaigns used by the company. Marketing campaigns are targeted to certain segments and are often changing.
- The study analyzes customer behavior at a specific point in time. Consumer preferences and market dynamics are constantly evolving; hence, the findings might not fully encapsulate future trends or shifts in consumer behavior.
- The data lacks qualitative insights such as customer motivations, preferences, and perceptions. These aspects are critical in understanding the deeper reasons behind purchasing decisions and customer loyalty.
- The analysis primarily focuses on descriptive and inferential statistics. Incorporating predictive modeling could provide foresight into future customer behaviors and market trends.
- The study does not account for external factors like economic conditions, cultural trends, or competitive actions that can significantly influence customer behavior.
Integrating Advanced Methodologies
- Integrating qualitative research methods such as interviews, focus groups, or surveys could offer invaluable insights into customer attitudes, motivations, and satisfaction levels
- Implementing advanced predictive analytics and machine learning models could forecast future customer behaviors and market trends, aiding in proactive decision-making.
- Considering the impact of external factors such as economic shifts, cultural changes, and competitive landscape dynamics to understand their influence on customer behavior.
- Detailed mapping of customer journeys for each segment can provide deeper insights into various touchpoints and opportunities for enhancing customer experience.
- Feature engineering to show success rates and marketing them of previous campaigns.
- Using other clustering methods and preprocessing techniques.
References
Patel, A. (2021, August 23). Customer Personality Analysis. Www.kaggle.com. https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis